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Abstract 

We consider the problem of learning from distributed data in the agnostic setting, i.e., in the presence 
of arbitrary forms of noise. Our main contribution is a general distributed boosting-based procedure for 
learning an arbitrary concept space, that is simultaneously noise tolerant, communication efficient, and 
computationally efficient. This improves significantly over prior works that were either communication 
efficient only in noise-free scenarios or computationally prohibitive. Empirical results on large synthetic 
and real-world datasets demonstrate the effectiveness and scalability of the proposed approach. 


1 Introduction 

Distributed machine learning has received an increasing amount of attention in this “big data” era lfl6L The 
most common use case of distributed learning is when the data cannot fit into a single machine, or when one 
wants to speed up the training process by utilizing parallel computation of multiple machines li2Tl[25li26l . 

In these cases, one can usually freely distribute the data across entities, and an evenly distributed partition 
would be a natural choice. 

In this paper, we consider a different setting where the data is inherently distributed across different lo¬ 
cations or entities. Examples of this scenario include scientific data gathered by different teams, or customer 
information of a multinational corporation obtained in different countries. The goal is to design an efficient 
learning algorithm with a low generalization error over the union of the data. Note that the distribution of the 
data from each source may be very different. Therefore, to deal with the worst-case situation, we assume the 
data can be adversarially partitioned. This scenario has been studied for different tasks, such as supervised 
learning (US]EL unsupervised learning tUO, and optimization ll6l fT5l . 

Traditional machine learning algorithms often only care about sample complexity and computational 
complexity. However, since the bottleneck in the distributed setting is often the communication between 
machines |Q]]. the theoretical analysis in this paper will focus on communication complexity. A baseline 
approach in this setting would be to uniformly sample examples from each entity and perform centralized 
learning at the center. By the standard VC-theory, a sampling set of size 0(4- log f) is sufficient. The 
communication complexity of this approach is thus 0(4 log f) examples. 

More advanced algorithms with better communication complexities have been proposed in recent works fU 
1. For example, [Q]] proposes a generic distributed boosting algorithm that achieves communication with 
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only logarithmic dependence on 1/e for any concept class. Unfortunately, their method only works in the 
standard realizable PAC-learning setting, where the data can be perfectly classified by a function in the 
hypothesis set and is noiseless. This is because many boosting algorithms are vulnerable to noise |9l [22|. 
The realizable case is often unrealistic in real-world problems. Therefore, we consider the more general 
agnostic learning setting l20l . where there is no assumption on the target function. Since it is impossible 
to achieve an arbitrary error rate e, the goal in this setting is to find a hypothesis with error rate close to 
opt(H), the minimum error rate achievable within the hypothesis set H. The error bound is often in the 
form of 0(opt(H)) + e. Balcan et al. [d propose an algorithm based on the robust generalized halving 
algorithm with communication complexity of 0(klog(\H\) log(l/e)) examples. However, the algorithm 
works only for a finite hypothesis set H and is computationally inefficient. 

We propose a new distributed boosting algorithm that works in the agnostic learning setting. While our 
algorithm can handle this much more difficult and more realistic scenario, it enjoys the same communication 
complexity as in fl] that is logarithmic in 1/e and exponentially better than the natural baselines. The 
algorithm is computationally efficient and works for any concept class with a finite VC-dimension. The key 
insight, inspired by HI, is that a constant (independent of e) number of examples suffice to learn a weak 
hypothesis, and thus if the boosting algorithm only needs 0(log -) iterations, we obtain the desired result. 

A key challenge in this approach is that most agnostic boosting algorithms either have poor error 
bound guarantees or require too many iterations. The first agnostic boosting algorithm was proposed in 
li5l . Although the number of iterations is 0(log }) and is asymptotically optimal, their bound on the final 
error rate is much weaker: instead of 0(opt(H)) + e, the bound is 0(opt(H ) c + e, where c(/3) = 
2(1/2 — /3) 2 / ln(l//3 — 1). Some subsequent works IT91IT31 significantly improve the bound on the error 
rate. However, their algorithms all require 0(l/e 2 ) iterations, which can in turn result in 0(l/e 2 ) commu¬ 
nication in the distributed setting. Fortunately, we identify a very special boosting algorithm fl8l that runs 
in 0(log i) iterations. This algorithm was analyzed in the realizable case in the original paper, but has later 
been noted to be able to work in the agnostic setting flofl We show how to adapt it to the distributed set¬ 
ting and obtain a communication efficient distributed learning algorithm with good agnostic learning error 
bound. Our main contributions are summarized as follows. 

• We identify a centralized agnostic boosting algorithm and show that it can be elegantly adapted to 
the distributed setting. This results in the first algorithm that is both computationally efficient and 
communication efficient to learn a general concept class in the distributed agnostic learning setting. 

• Our proposed algorithm, which is a boosting-based approach, is flexible in that it can be used with 
various weak learners. Furthermore, the weak learner only needs to work in the traditional centralized 
setting rather than in the more challenging distributed setting. This makes it much easier to design 
new algorithms for different concept classes in the distributed setting. 

• We confirm our theoretical results by empirically comparing our algorithm to the existing distributed 
boosting algorithm |T|. It does much better on the synthetic dataset and achieves promising results on 
real-world datasets as well. 

'in the prior version of this paper which appeared in AISTATS 2016, we claimed that we were the first to show its guarantees 
in the agnostic setting. We thank the correction from the author of m that, although not explicitly proved or shown as a theorem, 
the feasibility of the algorithm in the agnostic setting has already been discussed in Co]. 
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2 Problem Setup 


We first introduce agnostic learning as a special case of the general statistical learning problem. Then, we 
discuss the extension of the problem to the distributed setting, where the data is adversarially partitioned. 

2.1 Statistical learning problem 

In statistical learning, we have access to a sampling oracle according to some probability distribution D 
over X x {—1,1}. The goal of a learning algorithm is to output a hypothesis h with a low error rate with 
respect to D, defined as err/>(//) = E^ x y ^ D {h{x) yf y). Often, we compare the error rate to the minimum 
achievable value within a hypothesis set H, denoted by erro(H) = iuf/,/ e /y erro(h'). More precisely, a 
common error bound is in the following form. 

erro(h) < c ■ erro{H) + e, (1) 

for some constant c > 1 and an arbitrary error parameter e > 0. 

Many efficient learning algorithms have been proposed for the realizable case, where the target function 
is in H and thus errr>(H ) = 0. In this paper, we consider the more general case where we do not have any 
assumption on the value of errj-)(II). This is often called the agnostic learning setting 11201 . Ideally, we 
want c in the bound to be as close to one as possible. However, for some hypothesis set H, achieving such 
a bound with c = 1 is known to be NP-hard ifTTl . 

2.2 Extension to the distributed setting 

In this work, we consider the agnostic learning problem in the distributed learning framework proposed by 
fill . In this framework, we have k entities. Each entity i £ [k] has access to a sampling oracle according 
to a distribution Di over X x {- 1 , 1}. There is also a center which can communicate with the k entities 
and acts as a coordinator. The goal is to learn a good hypothesis with respect to the overall distribution 
D = j: Yli=i Di without too much communication among entities. It is convenient to calculate the commu¬ 
nication by words. For example, a d-dimensional vector counts as 0(d) words. 

Main goal. The problem we want to solve in this paper is to design an algorithm that achieves error 
bound (□} for a general concept class H. The communication complexity should depend only logarithmically 
on 1/e. 

3 Distributed agnostic boosting 

In this work, we show a distributed boosting algorithm for any concept class with a finite VC-dimension 
d. In the realizable PAC setting, the boosting algorithm is assumed to have access to a 7 -weak learner that, 
under any distribution, finds a hypothesis with error rate at most 1/2 — 7 . This assumption is unrealistic 
in the agnostic setting since even the best hypothesis in the hypothesis set can perform poorly. Instead, 
following the setting of 0, the boosting algorithm is assumed to have access to a /3-weak agnostic learner 
defined as follows. 

Definition 1. A (3-weak agnostic learner, given any probability distribution D, will return a hypothesis h 
with error rate 

erro(/i) < errp^H) + 
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Detailed discussion of the existence of such weak learners can be found in [|5]- Since error of 1/2 can be 
trivially achieved, in order for the weak learner to convey meaningful information, we assume errn{H ) < 
1/2 — /3. Some prior works use different definitions. For example, ffTTI uses the definition of (a, 7 )-weak 
learner. That definition is stronger than ours, since an (a:, "./-weak learner in that paper implies a /3-weak 
learner in our paper with (3 = a — 7 . Therefore, our results still hold by using their definition. Below we 
show an efficient agnostic boosting algorithm in the centralized setting. 

3.1 Agnostic boosting: centralized version 

The main reason why many boosting algorithms (including AdaBoost l~f~2l and weight-based boosting l23l 
l24l l fail in the agnostic setting is that they tend to update the example weights aggressively and may end up 
putting too much weight on noisy examples. 

To overcome this, we consider a smoothed boosting algorithm fl 8 l . shown in Algorithm Q] This al¬ 
gorithm uses at most 0 (log 1 /e) iterations and enjoys a nice “smoothness” property, which is shown to be 
helpful in the agnostic setting |[T3l . The algorithm was originally analyzed in the realizable case but has later 
been noted to be able to work in the agnostic setting ifTOl . Below, for completeness we show the analyses of 
the algorithm in both the realizable and agnostic settings. 

The boosting algorithm adjusts the example weights using the standard multiplicative weight update 
rule. The main difference is that it performs an additional Bregman projection step of the current example 
weight distribution into a convex set V after each boosting iteration. The Bregman projection is a general 
projection technique that finds a point in the feasible set with the smallest “distance” to the original point 
in terms of Bregman divergence. Here we use a particular Bregman divergence called relative entropy 
RE(p || q) = ( p t In (jpi/qi) for two distributions p and q. To ensure that the boosting algorithm always 

generates a “smooth” distribution, we set the feasible set V to be the set of all e-smooth distributions, which 
is defined as follows. 

Definition 2. A distribution D on S is called e-smooth if max, D(i) < 

It is easy to verify that V is a convex set. The complete boosting algorithm is shown in Algorithm Q] 
and the theoretical guarantee in Theorem Q] The proof, included in the appendix, is similar to the one in 
m, except that they use real-valued weak learners, whereas here we only consider binary hypotheses for 
simplicity. 

Theorem 1. Given a sample S and access to a "/-weak learner, Algorithm\I\makes at most T = O ( los! ; ) 
calls to the weak learner with e-smooth distributions and achieves error rate e on S. 

Note that in Theorem [Q it is not explicitly assumed to be in the realizable case. In other words, If we 
have a 7 -weak learner in the agnostic setting, we can achieve the same guarantee. However, in the agnostic 
setting, we only have access to a (3 -weak agnostic learner, which is a much weaker and more realistic 
assumption. The next theorem shows the error bound we get under this usual assumption in the agnostic 
setting. 

Theorem 2. Given a sample S and access to a (3-weak agnostic learner, AlgorithmUluses at most 0( ) 

iterations and achieves an error rate + 6 on where errs(H) is the optimal error rate on S 

achievable using the hypothesis class H. 

Proof. The idea is to show that as long as the boosting algorithm always generates some e'-smooth distri¬ 
butions, the /3-weak agnostic learner is actually a 7 -weak learner for some 7 > 0, i.e., it achieves error 
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Algorithm 1 Centralized Smooth Boosting algorithm |jT 8 l 

Initialization: Fix a 7 . Let I) 11 1 to be the uniform distribution over the dataset S. 

for t = 1, 2,... ,T do 

1. Call the weak learner with distribution and obtain a hypothesis h 1 ' 1 

2. Update the example weights 

D^ t+1 \i) = £)(*)(») • (1 - iff I 

where = 1 [h^\xi) = y,] and Z- 1 ' 1 = JT D^\i) ■ (1 — 7 ft ^ is the normalization factor. 

3. Project z)P +1 ) into the feasible set V of e-smooth distributions 

£)(t+i) _ ar g m j n re(d || £)( t+1 f 

D&V 

end for 

Output: The hypothesis sign ^ Ylt=i 


rate 1/2 — 7 for any e'-smooth distributions. In each iteration t, the /Tweak agnostic learner, given S with 
distribution D^\ returns a hypothesis h® such that 


err nW ( h(t) ) < err D(t ) {H) +/3 
< ferrs(H) + f3. 

The second inequality utilizes the e'-smoothness property of D^\ The reason is that if h is the optimal 
hypothesis on S, we have 


err D(t ){H) < err D(t) (h) < 


#mistakes on S 


—err s {h) = -err s {H). 


Let ^errs(H) + /3 = ^ — 7 , or equivalently 7 = — f3) — ferrs(H). Then, if e' > , we 

have 7 > /(1/2 — ff) > 0. Therefore, we can use Theorem [H and achieves error rate e' on S by using 
()() iterations. Alternatively, it achieves error rate by using 0{ pjpPpi ) iterations. □ 

Next, we show how to adapt this algorithm to the distributed setting. 


3.2 Agnostic boosting: distributed version 

The technique of adapting a boosting algorithm to the distributed setting is inspired by |l|. They claim 
that any weight-based boosting algorithm can be turned into a distributed boosting algorithm with commu¬ 
nication complexity that depends linearly on the number of iterations in the original boosting algorithm. 
However, their result is not directly applicable to our boosting algorithm due to the additional projection 
step. We will describe our distributed boosting algorithm by showing how to simulate the three steps in each 
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iteration of Algorithm Q] in the distributed setting with 0(d) words of communication. Then, since there are 
at most 0( log(l/e)) iterations, the desired result follows. 

In step 1, in order to obtain a 2/3-weak hypothesis (we use 2/3 instead of /3 for convenience, which 
only affects the constant terms), the center calls the /3-weak agnostic learner on a dataset sampled from 
I) (L> = t Yli=i of '. The sampling procedure is as follows. Each entity first sends its sum of weights to 
the center. Then, the center samples 0(4'j log jj) examples in total across the k entities proportional to their 
sum of weights. By the standard VC-theory, the error rate of any hypothesis on the the sample is within /3 
to the true error rate with respect to the underlying distribution, with high probability. It is thus sufficient 
to find a hypothesis with error within /3 to the best hypothesis, which can be done thanks to the assumed 
/3-weak learner. 

Step 2 is relatively straightforward. The center broadcasts hy’ and each entity updates its own internal 
weights independently. Each entity then sends the summation of internal weights to the center for the 
calculation of the normalization factor. The communication in this step is 0(kd) for sending h (t '> and some 
numbers. What is left is to show that the projection in step 3 can be done in a communication efficient 
way. As shown in fl~4ll . the projection using relative entropy as the distance into V, the set of all e-smooth 
distributions, can be done by the following simple algorithm. 

For a fixed index m, we first clip the largest m coordinates of p to —, and then rescale the rest of the 
coordinates to sum up to 1 — —. We find the least index m such that the resulting distribution is in V, i.e. 
all the coordinates are at most —. A naive algorithm by first sorting the coordinates takes 0(n log n) time, 
but it is communicationally inefficient. 

Fortunately, lfl4l also proposes a more advanced algorithm by recursively finding the median. The idea 
is to use the median as the threshold, which corresponds to a potential index m, i.e., m is the number of 
coordinates larger than the median. We then use a binary search to find the least index m. The distributed 
version of the algorithm is shown in Algorithm [2] 

Theorem 3. A logarithm \2\projects a n-dimensional distribution into the set of all e-smooth distributions V 
with 0(k log 2 (n)) words of total communication complexity. 

Proof Since Algorithm [2] is a direct adaptation of the centralized projection algorithm in fl4ll . we omit the 
proof of its correctness. Because we use a binary search over possible thresholds, the algorithm runs at most 
0(log(n)) iteration. Therefore, it suffices to show that the communication complexity of finding the median 
is at most 0(k\ogn). This can be done by the iterative procedure shown in Algorithm [3j Each entity first 
sends its own median to the center. The center identifies the maximum and minimum local medians, denoted 
as m and m, respectively. The global median must be between m and m, and removing the same number 
of elements larger than or equal to m and less than m will not change the median. Therefore, the center 
can notify the two corresponding entities and let them remove the same number of elements. At least one 
entity will reduce its size by half, so the algorithm stops after 0(k log n) iterations. Note that except for the 
first round, we only need to communicate the updated medians of two entities at each round, so the overall 
communication complexity is 0(k log n ) words. 

In practice, it is often easier and more efficient to use a quickselect -based distributed algorithm to find 
the median. The idea is to randomly select and broadcast a weight at each iteration. This, in expectation, can 
remove half of the possible median candidates. This approach achieves the same communication complexity 
in expectation. □ 

The complete distributed agnostic boosting algorithm is shown in Algorithm [4] We summarize our 
theoretical results in the next Theorem. 


6 


Algorithm 2 Distributed Bregman projection algorithm 

Input: 

Each entity i: 

a disjoint subset Wj of W = {w \,..., w n } 

Center: 


no = n; (7 = 0; C™ = 0 

while no / 0 do 

distributedly find the median 6 of (Wi,..., W&) 


Each entity i: 

Pi = {w : w < 6 ,vj £ Wj}; AH = {w : w = 9, w £ Wi}; Hi = {w : w > 6,w £ Wj} 

k = e„«, w Mr = E„ eJMi w «r = e„ 6 «, » 

Center: 

L = J2i IA|; M = £ilM|; = 

L w = L?-, M w = £7 Mf; IT" = Ei Hf 

1-(C7+J?) — 

m 0 = and broadcasts it 

if 0m o > — then 

u en 

c = C + H + M; C w = C W + H W + M W 
if L = 0 then 0 = max(u' : w < 6, w £ W) 
set no = L and notify each entity i to set Wi = 

else 

set no = H and notify each entity i to set Wi = 'Hi 

end while 

Center: m (l = and broadcasts it 

Each entity i: set each coordinate as w'- = < en Wl > f 

* I uiimo if Wi < 6 


Theorem 4. Given access to a /3-weak agnostic learner, Algorithm 0 achieves error rate + e 

fiy using at most 0( ) rou nds, eac h involving 0((d//3 2 ) log(l//3)) examples and an additional 

0{kd log 2 (|f 7 f~gy“)) words of communication per round. 

Proof. The boosting algorithm starts by drawing from A) a sample 5 of size n = Q( ) across 

the k entities without communicating them. If S is a centralized dataset, then by Theorem [2] we know 
that Algorithm 0 achieves error rate + § on S using ()() iterations. We have shown 

that Algorithm 0 is a correct simulation of Algorithm 0 in the distributed setting, and thus we achieve the 
same error bound on S. The number of communication rounds is the same as the number of iterations 
of the boosting algorithm. And in each round, the communication includes 0(d//3 2 log(1/7)) examples 
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Algorithm 3 Distributedly finding the median 

Input: 

Each entity i: a disjoint subset W; of W = { w i...., w n } 

Each entity i: Send the median rn t of W, to the center 
while W, > 1 for some i £ [k] do 

Center: 

Find the maximum and minimum of the k medians, denoted by m and m and 
notify the corresponding entities, denoted by A and B. 

Entity A : Send n = \i : Wi £ Wa, Wi > m | to the center 
Entity B : Send n = \i : Wi £ Wb, Wj < m\ to the center 
Center: Send r = minjn, n } to entity A and B 
Entity A: Remove the largest r elements in Yv,\ 

Entity B: Remove the smallest r elements in W/j 
Entity A and B : Send the new median to the center 
end while 


for finding the /3-weak hypothesis, 0{kd) words for broadcasting the hypothesis and some numbers, and 
0(k log 2 (n)) words for the distributed Bregman projection. 

So far we only have the error bound of 2e ] ? J 2 s ^' > + | on S. To obtain the generalization error bound, 
note that with n = 0( ) an ^ by the standard VC-dimension argument, we have that with high 

probability errs{H) < errn (II) + l ' 1 ^ 2 ^ r>f \ and the generalization etror of our final hypothesis deviates 
from the empirical error by at most e/4, which completes the proof with the desired generalization error 
bound. □ 

4 Experiments 

In this section, we compare the empirical performance of the proposed distributed boosting algorithms with 
two other algorithms on synthetic and real-world datasets. The first one is distributed AdaBoost JH, which is 
similar to our algorithm but without the projection step. The second one is the distributed logistic regression 
algorithm available in the MPI implementation of the Liblinear package (27]]. We choose it as a comparison 
to a non-boosting approach. Note that Liblinear is a highly-optimized package while our implementation 
is not, so the comparison in terms of speed is not absolutely fair. Flowever, we show that our approach, 
grounded in a rigorous framework, is comparable to this leading method in practice. 

4.1 Experiment setup 

All three algorithms are implemented in C using MPI, and all the experiments are run on Amazon EC2 with 
16 m3.1arge machines. The data is uniformly partitioned across 16 machines. All the results are averaged 
over 10 independent trials. Logistic regression is a deterministic algorithm, so we do not show the standard 
deviation of the error rate. We however still run it for 10 times to get the average running time. Since 








Algorithm 4 Distributed agnostic boosting algorithm 

Initialization: 

Center: Access to a /3-agnostic weak learner. Set 7 = — /3) 

Each entity i: 

Sample S', drawn from Di such that S = U,Si with size n = 0( )) 

Set weights xj 'j = 1/15,1 for each ( x, y ) € 5, 
for t = 1,2, ... ,T do 

Each entity z: Send wp = YlxeS- v tl , to ^ center 

Center: Let W® = Yli w \- Determine the number of examples np to request from each 
entity i by sampling 0(-^ log times from the multinomial distribution wf'/W^\ 
and then send each number nf to entity i. 

Each entity i: sample rip times from Si proportional to vp : and send them to the center 
Center: run the /3-agnostic weak learner on the union of the received 0(pi log p) examples, 

and then broadcast the returned hypothesis hP' 1 
Each entity i : update the weight of each example (x. y ) 

pt+i) = f - 7) if (. x ) = y 

l,x | i/p otherwise 

Distributedly normalize and then project the weights by Algorithm [2] 

end for 

Output: The hypothesis sign YPt=i 


each algorithm has different number of parameters, for fairness, we do not tune the parameters. For the two 
boosting algorithms, we use T = 100 decision stumps as our weak learners and set (5 = 0.2 and e = 0.1 in 
all experiments. For logistic regression, we use the default parameter (7 = 1. 

4.2 Synthetic dataset 

We use the synthetic dataset from ll22l . This dataset has an interesting theoretical property that although it 
is linearly separable, by randomly flipping a tiny fraction of labels, all convex potential boosting algorithms, 
including AdaBoost, fail to learn well. A random example is generated as follows. The label y is randomly 
chosen from {—1, +1} with equal odds. The feature x = (aq,..., X21), where x, £ {—1, +1}, is sampled 
from a mixture distribution: 1) With probability 1/4, set all a:, to be equal to y. 2) With probability 1/4, 
set x\ = X 2 = ■ ■ ■ = x\\ = y and x \2 = 373 = • • • = X 21 = —y. 3) With probability 1/2, randomly 
set 5 coordinates from the first 11 and 6 coordinates from the last 10 to be equal to y. Set the remaining 
coordinates to —y. 

We generate 1,600,000 examples in total for training on 16 machines and test on a separate set of size 
100,000. The results are shown in Table Q] One can see that our approach (Dist.SmoothBoost), is more 
resistant to noise than Dist.AdaBoost and significantly outperforms it for having upto 1% noise. In high 
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Table 1: Average (over 10 trials) error rate (%) and standard deviation on the synthetic dataset 


Noise 

Dist.AdaBoost 

Dist.SmoothBoost 

Liblinear-LR 

0.1% 

11.64 ±3.82 

4.28 ± 0.66 

0.00 

1% 

25.97 ± 1.56 

13.38 ±4.66 

0.00 

10% 

28.04 ± 0.94 

27.07 ± 1.60 

37.67 


Table 2: Average (over 10 trials) error rate (%) and standard deviation on real-world datasets 

Dataset 

^examples 

# features 

Dist.AdaBoost 

Dist.SmoothBoost 

Liblinear-LR 

Adult 

48,842 

123 

15.71 ±0.16 

15.07 ± 2.32 

15.36 

IjcnnI 

141,691 

22 

5.90 ±0.10 

4.33 ±0.18 

7.57 

Cod-RNA 

488,565 

8 

6.12 ± 0.09 

6.51 ±0.11 

11.79 

Covtype 

581,012 

54 

24.98 ± 0.22 

24.68 ± 0.30 

24.52 

Yahoo 

3,251,378 

10 

37.08 ±0.15 

36.86 ± 0.27 

39.15 


noise setting (10%), Liblinear performs poorly, while our approach achieves the best error rate. 

4.3 Real-world datasets 

We run the experiments on 5 real-world datasets with sizes ranging from 50 thousands to over 3 millions: 
Adult, IjcnnI, Cod-RNA, and Covtype from the LibSVM data repository H; Yahoo from the Yahoo! 
WebScope dataset 0. The Yahoo dataset is used for predicting whether a user will click the news article 
on their front page. It contains user click logs and is extremely imbalanced. We trim down this dataset so 
that the number of positive and negative examples are the same. The detailed information of the datasets are 
summarized in Table |2] Each dataset is randomly split into 4/5 for the training set and 1/5 for the testing set. 

The average error rate and the total running time are summarized in Table [2] and Table [3] respectively. 
The bold entries indicates the best error rate. Our approach outperforms the other two on 3 datasets and 
performs competitively on the other 2 datasets. In terms of running time, Liblinear is the fastest on all 
datasets. However, the communication of our algorithm only depends on the dimension d, so even for 
the largest dataset (Yahoo), it can still finish within 4 seconds. Therefore, our algorithm is suitable for 
many real-world situations where the number of examples is much larger than the dimension of the data. 
Furthermore, our algorithm can be used with more advanced weak learners, such as distributed logistic 
regression, to further reduce the running time. 

5 Conclusions 

We propose the first distributed boosting algorithm that enjoys strong performance guarantees, being si¬ 
multaneously noise tolerant, communication efficient, and computationally efficient; furthermore, it is quite 
flexible in that it can used with a variety of weak learners. This improves over the prior work of 0] El that 
were either communication efficient only in noise-free scenarios or computationally prohibitive. While en- 

“±ttp://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets 
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Table 3: Average run time (sec) on real-world datasets 


Dataset 

Dist.AdaBoost 

Dist.SmoothBoost 

Liblinear-LR 

Adult 

5.02 

15.54 

0.06 

Ijcnnl 

0.76 

9.19 

0.10 

Cod-RNA 

1.08 

10.11 

0.12 

Covtype 

3.71 

6.48 

0.31 

Yahoo 

3.37 

3.79 

1.37 


joying nice theoretical guarantees, our algorithm also shows promising empirical results on large synthetic 
and real-world datasets. 

Finally, we raise some related open questions. In this work we assumed a star topology, i.e., the center 
can communicate with all players directly. An interesting open question is to extend our results to general 
communication topologies. Another concrete open question is reducing the constant in our error bound 
while maintaining good communication complexity. Finally, our approach uses centralized weak learners 
for learning general concept classes, so the computation is mostly done in the center. Are there efficient dis¬ 
tributed weak learners for some specific concept classes? That could provide a more computation balanced 
distributed learning procedure that enjoys strong communication complexity as well. 
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A Proof of Theorem [T] 

Theorem [D Given a sample S and access to a y-weak learner, Algorithm\J\makes at most T = Q( ) 

calls to the weak learner with e-smooth distributions and achieves error rate e on S. 

Proof. The analysis is based on the well-studied online learning from experts problem. In each round t, 
the learner has to make a decision based on the advice of n experts. More precisely, the learner chooses 
a distribution D'N from a convex feasible set V and follows the advice of the 7-th expert with probability 
(i). Then, the losses of each expert’s suggested actions are revealed as a vector ++ The expected loss of 
the learner incurred by using is thus fS 1 ' 1 D- f \ The goal is to achieve a total expected loss i ^ 

not too much more than min o e -p the cost of always using the best fixed distribution in V. 

Step 2 and 3 of Algorithm [H which is also known as the multiplicative weights update algorithm, has the 
following regret bound fl4l . 

Lemma 1. For any 0 < 7 < 1/2 and any positive integer T, the multiplicative weights update algorithm 
generates distributions D^\ ..., D^ £ V where each is computed only based on ..., 
such that for any D £ V, 

£*(*)£><*> < (l +1 )f2i it) D+ mD 11 DW) , 

t =1 t =1 ^ 

where, for two distributions p and q, the relative entropy RE{p || q ) = YhiPi 1 n (Pi/Qi)- 

To use the above result in boosting, we can think of the n examples in sample S as the set of experts. 
The learner’s task is thus to choose a distribution D® £ V over the sample at each round. The loss is 
defined to be 1 [h^^xf) = ?/,], where hM'i is the hypothesis returned by the weak learner. To ensure that the 
boosting algorithm always generates a “smooth” distribution, we set the feasible set V to be the set of all 
e-smooth distributions. Below we show how this can be applied in boosting, as suggested by IT8l . 

By the assumption of the 7- weak learner, we have 


= ^D^\i)l[h { - t \x i ) = Vi ] > 1/2 + 7. 

i 
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After T = \ Ill . 1 l /I •* ] + 1 rounds, we set the final hypothesis / = sign(y Y2t=i h®). Let E C S be the set 
of examples where / predicts incorrectly. Suppose \E\ > en. Let D = u E , the uniform distribution on E 
and 0 elsewhere. It is easy to see that ue £ V, since \E\ > en. For each example ( Xi,yi ) £ E, we have 

T T 

= Y J Ah {t \x i ) = y i ]< T /2, 

t =l t =i 

since / misclassifies (atj, j/i). Therefore, YlJ=\ ^ U E < T /2. Furthermore, since \E\ > en, we have 

RE{D || D {1) ) = RE{u e || D (1) ) < ln(l/e). 

By plugging these facts into the inequality in Lemma [Q we get 

(1/2 + 7 )T < (1 + 7 )T/2 + 

7 

which implies T < 2ln ^i/^ , a contradiction. □ 
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